Welcome to the aRt of the figure: visualising your data using R. This file contains a fairly large number of exercises and examples. We will only be using a subset of them in this 2-hour workshop. The choice of exercises will depend on the level of the participants. If you are completely new to R, worry not, there will be a short intro to R in the beginning to bring everyone up to speed (there is also an introductory exercise for beginners in the preparation tutorial which I hope you completed). Of course you are welcome to work through the rest in your own time and make use of any example code for your own work.
This section contains some basic FAQ and tips. It’s here at the top so that if you get stuck or confused, you can easily find it.
help(functionname) or searching for the function by name in the Help tab on the right. Arguments have names, but names can be omitted if using them in their intended order.Tools -> Global Options -> R Markdown -> untick “Show plots inline…”
This indicates the package is not loaded. Use the relevant library() command to load the package that includes the missing function. There are library("package") calls in the beginning of each section that requires them. You really need to load a package once per session, but they are there anyway to keep the script modular for easier revisiting.
You should have installed the necessary packages before the start of the workshop. If you did not (indicating by library() giving you a package not found error), then here are the relevant installation commands.
Let’s get started by running our first R command (well maybe not first ever if you’ve used R before).
# This is a code block, distinguishable by the gray shaded background.
# This is a line of code:
print( "Hello! Put your text cursor on this line (click on the line). Anywhere on the line. Now press CTRL+ENTER (PC) or CMD+ENTER (Mac). Just do it." )
# The command above, when executed (what you just did), printed the text in the console below. Also, this here is a comment. Commented parts of the script (anything after a # ) are not executed. This R Markdown file has both code blocks (gray background) and regular text (white background).
(Also, if you’ve been scrolling left and right in the script window to read the code, turn on text wrapping ASAP: on the menu bar above, go to Tools -> Global Options -> Code (tab on the left) -> tick “Soft-wrap R source files”)
So, print() is a function. Most functions look something like this:
myfunction(inputs, parameters)All the inputs to the function go inside the ( ) brackets, separated by commas. In the above case, the text is the input to the print() function. All text, or “strings”, must be within quotes. Note that commands may be nested; in this case, the innermost are evaluated first:
function2( function1(do, something), parameters_for_function1 )Don’t worry if that’s all a bit confusing for now. Let’s try another function, sum():
sum(1,10) # cursor on the line, press CTRL+ENTER (or CMD+ENTER on Mac)
# You should see the output (sum of 1 and 10) in the console.
# Important: you can always get help for a function and check its input parameters by executing
help(sum) # put the name of any function in the brackets
# ...or by searching for the function by name in the Help tab on the right.
# Exercise. You can also write commands directly in the console, and executing them with ENTER. Try some more simple maths - math in R can also be written using regular math symbols (which are really also functions). Write 2*3+1 in the console below, and press ENTER.
# Let's plot something. The command for plotting is, surprisingly, plot().
# It (often) automatically adopts to data type (you'll see how soon enough).
plot(42, main = "The greatest plot in the world") # execute the command; a plot should appear on the right.
# OK, that was not very exciting. But notice that a function can have multiple inputs, or arguments. In this case, the first argument is the data (a vector of length one), and the second is 'main', which specifies the main title of the plot.
# You can make to plot bigger by pressing the 'Zoom' button above the plot panel on the right.
# Let's create some data to play with. We'll use the sample() command, which creates random numbers from a predifined sample. Basically it's like rolling a dice some n times, and recording the results.
sample(x = 1:6, size = 50, replace = T) # execute this; its output is 50 numbers
# If an output is not assigned to some object, it usually just gets printed in the console. It would be easier to work with data, if we saved it in an object. For this, we need to learn assignement, which in R works using the equals = symbol (or the <-, but let's stick with = for simplicity).
dice = sample(x = 1:6, size = 50, replace = T) # what it means: xdata is the name of a (new) object, the equals sign (=) signifies assignement, with the object on the left and the data on the right. In this case, the data is the output of the sample() function. Instead of printing in the console, the output is assigned to the object.
dice # execute to inspect: calling an object usually prints its contents into the console below.
# Let's plot:
hist(dice, breaks=20, main="Frequency of dice values") # plots a histogram (distribution of values)
plot(dice) # plots data as it is ordered in the object
xmean = mean(dice) # calculate the mean of the 50 dice throws
abline(h = xmean, lwd=3) # plot the mean as a horizontal line
# Exercise: compare this plot with your neighbor. Do they look the same? Why/why not?
# Exercise: use the sample() function to simulate 25 throws of an 8-sided DnD dice.
Numerical values include things we can measure on a continuous scale (height, weight, reaction time), things that can be ordered (“rate this on a scale of 1-5”), and things that have been counted (number of participants in an experiment, number of words in a text).
We will use a built-in classic dataset called “iris” - it contains information about a bunch of flowers.
data("iris") # load the data into the workspace (or "global environment").
# We can also inspect the data using R commands.
head(iris) # prints the first rows
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
summary(iris) # produces an automatic summary of the columns
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
# In RStudio, you can also have a look at the dataframe by clicking on the little "table" icon next to it in the Environment section (top right).
help(iris) # built in datasets often have help files attached
# Plotting time! Let's see for example how long the petals are in the dataset
iris$Petal.Length # the $ is used for accessing (named) column of a dataframe
## [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3
## [18] 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4
## [35] 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7
## [52] 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1
## [69] 4.5 3.9 4.8 4.0 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5
## [86] 4.5 4.7 4.4 4.1 4.0 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1
## [103] 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9
## [120] 5.0 5.7 4.9 6.7 4.9 5.7 6.0 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1
## [137] 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 5.7 5.2 5.0 5.2 5.4 5.1
iris[, "Petal.Length"] # this is the other indexing notation: [row, column]
## [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3
## [18] 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4
## [35] 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7
## [52] 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1
## [69] 4.5 3.9 4.8 4.0 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5
## [86] 4.5 4.7 4.4 4.1 4.0 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1
## [103] 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9
## [120] 5.0 5.7 4.9 6.7 4.9 5.7 6.0 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1
## [137] 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 5.7 5.2 5.0 5.2 5.4 5.1
plot(iris$Petal.Length) # two observations: there is quite a bit of variation, and it seems there are clusters in the data
hist(iris$Petal.Length, breaks=10) # a histogram shows the distribution of values ('breaks' change resolution)
boxplot(iris$Petal.Length) # a boxplot is like a visual summary()
points(x=rep(1, nrow(iris)), y=iris$Petal.Length) # could also add actual datapoints
Exercise. Make a new code block here for the exercises (insert… on the toolbar above). An easy way to deal with overlapping points is to add noise in the dimension that is not informative anyway. Copy the boxplot and points commands from above, and modify the points command by wrapping the rep(1, nrow(iris)) bit inside a jitter() function, so it looks something like: points(x=jitter(rep...
Make sure the brackets match. If you are unsure at any time what a command does inside another command, run it reparately and see what happens. Remember, think what is the input, and what is the output.
Exercise. Here’s something else to try: the default color of the points is black. Change it to something else by adding the parameter col to the points command (remember, parameters are separated by commas, and they are given values using the = sign; color names must be in quotes, e.g., “darkred”).
# Another way to plot boxplots, grouping them by some relevant variable:
boxplot(iris$Petal.Length ~ iris$Species) # note the ~ notation
grid() # why not add a grid for reference
# A slightly nicer version:
boxplot(iris$Petal.Length ~ iris$Species, ylab="petal length",
border=c("plum3", "darkblue", "lightblue"), boxwex=0.7, cex=0.4)
abline(h=1:7, col=rgb(0,0,0,0.1)) # adds vertical lines instead of full grid
The rgb(red, green, blue, alpha) function allows making custom colors; alpha controls transparency. Possible values range between 0 and 1 by default. Below is a piece of code that generates an example of how the color scheme works (don’t worry if you don’t understand the actual code, this is above the level of this workshop; just put the cursor in the code block and press CTRL+SHIFT+ENTER (CMD+SHIFT+ENTER on Mac).
plot(iris$Sepal.Length, iris$Sepal.Width) # no interaction?
# Why not color-code by species. Here we make use of both styles of indexing.
iriscolors= c(rgb(0.2,0,0.3, 0.5), rgb(0,0,0.6,0.5), rgb(0,0.5,0.8,0.6)) # transparent colors
plot(iris$Sepal.Length, iris$Sepal.Width,
col=iriscolors[iris$Species], pch=20) # pch sets the point type
grid(col=rgb(0,0,0,0.3), lty=3)
# This is suddenly a lot of code out of nowhere... If some of it looks overwhelming, worry not! Everything will become clear once you get into R a bit more.
# Add some detail to make this legible to the colour-blind, and printable in black-and-white
irispoints = c(15,20,17) # see help(points) for more
plot(iris$Sepal.Length, iris$Sepal.Width, col = iriscolors[iris$Species] , pch = irispoints[iris$Species])
# Exercise. Make this publication-ready by adding proper labels and a legend
# Modify the code below by adding a suitable title (the parameter is called 'main'; assign it a value - remember, text must be in quotes) and axis labels (parameters xlab and ylab for the respective axes).
plot(iris$Sepal.Length, iris$Sepal.Width,
col=iriscolors[iris$Species] , pch=irispoints[iris$Species] )
#
grid(col=rgb(0, 0, 0, 0.2)) # this adds grid
legend("topleft", pch=irispoints, legend = levels(iris$Species), col=iriscolors, cex=0.7, bty="n") # this adds legend
While a whole subject on its own, we will have a quick look at plotting time series - data reflecting changes in some variable over time.
# This time we'll generate some random data and pretend it's real data.
# "The following data are reaction times to stimuli of one individual, over 100 trials, in an experiment on...whatever"
retime = c(runif(20, 0,0.1), seq(0.1, 2.8,length.out = 80)) * runif(100, 0.7, 1.1 )
# Have a look at the raw data first! (by now you already know how to do it)
# Now let's plot it
plot(retime, ylab="reaction time") # this plots points though
What can you tell by the looks of the data?
Exercise: improve this plot by adding the type parameter and setting its value as “l” (which stands for ‘line’, which is more useful in this instance), and set the X axis label (xlab) to say “trials” instead of the default “Index”.
Categorical/nominal/discrete values cannot be put on a continuous scale or ordered, and include things like binary values (student vs non-student) and all sorts of labels (noun, verb, adjective). Words in a text could be viewed as categorical data.
Here is another artificial dataset. Let’s pretend I went around Edinburgh and asked random people on the street the following question: “A new species of insect was recently discovered in Scotland, and they called it Boubicus Boubasus - or Bouba for short. What’s your intuition, is Bouba a big fat bug, or a small slim bug?” (and the same for Kikis Kikosius, or Kiki for short)
boubakiki = data.frame(
meanings=c(
sample(c("big", "small"),25,T, prob=c(0.8,0.2)),
sample(c("big", "small"),24,T, prob=c(0.3,0.7))
),
words=c(rep("bouba",25), rep("kiki",24))
) # this command will create the random data
# Have a look at the raw data first.
# In addition to eyeballing the data, use the following commands: nrow(), dim()
# Now let's use the table() function to make sense of it:
bktable = table(boubakiki)
bktable
## words
## meanings bouba kiki
## big 22 7
## small 3 17
mosaicplot(bktable, col=c("orange", "navy")) # a simple mosaic plot, displays proportions
barplot(bktable, ylab="big small") # a barplot, displays counts
# The library() command load the package from your library of packages. This needs to be done once per session, i.e. again if you restart R.
library("wordcloud")
## Loading required package: RColorBrewer
# Let's create an object with a bunch of text:
sometext = "In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort. It had a perfectly round door like a porthole, painted green, with a shiny yellow brass knob in the exact middle. The door opened on to a tube-shaped hall like a tunnel: a very comfortable tunnel without smoke, with panelled walls, and floors tiled and carpeted, provided with polished chairs, and lots and lots of pegs for hats and coats—the hobbit was fond of visitors. The tunnel wound on and on, going fairly but not quite straight into the side of the hill — The Hill, as all the people for many miles round called it — and many little round doors opened out of it, first on one side and then on another. No going upstairs for the hobbit: bedrooms, bathrooms, cellars, pantries (lots of these), wardrobes (he had whole rooms devoted to clothes), kitchens, dining-rooms, all were on the same floor, and indeed on the same passage. The best rooms were all on the left-hand side (going in), for these were the only ones to have windows, deep-set round windows looking over his garden, and meadows beyond, sloping down to the river."
# Now let's do some very basic preprocessing to be able to work with the words in the text:
clean = gsub("[[:punct:]]", "", sometext) # remove punctuation (that weird thing inside the gsub (R's find-and-replace command) is a regular expression; don't ask, it just works)
cleanlow = tolower(clean) # make everything lowecase
words = strsplit(cleanlow, split=" ")[[1]]
# Inspect the object we just created. It should be a vector of 232 words.
# Some ways to inspect and visualize textual data
sortedwords = sort(table(words)) # counts the words and sorts them
# Exercise: have a look at the data using the head() and tail() commands
plot(sortedwords, xaxt="n")
axis(1, 1:length(sortedwords), names(sortedwords), las=2, cex.axis=0.5) # add the words
# Time to use the wordcloud package we loaded earlier.
# If you get an error saying 'could not find function "wordcloud"', then you need to load the package (with the library command above).
wordfreqs = as.numeric(sortedwords) # get the frequencies from the table object
wordcloud(words = names(sortedwords), freq=wordfreqs, min.freq = 0)
# Note: if R gives you errors (saying word x could not fit), ignore them. Also, if plots look strange after using wordcloud, use the dev.off() command to reset graphics.
We have now seen how to visualize the most common types of data using R’s basic plotting tools. Before diving into various other things like networks and maps, here’s a few examples of using an alternative plotting package, ggplot2, to do similar things. ggplot2 uses a different approach to plotting, and a different syntax (just to confuse you a bit more). ggplot2 also offers default colors and aesthetics which some people find nicer than those of the base plot() (while others don’t).
library("ggplot2")
# Scatterplot of two numeric variables
ggplot(iris, aes(x=Petal.Length, y=Sepal.Length)) + geom_point(aes(col=Species)) +theme(legend.position="top")
# the data are defined in the ggplot command, the + adds layers, themes and other options
# try adding scale_colour_brewer(palette = "Dark2") or geom_smooth(method="lm", aes(col=Species))
# remove or move the legend using theme(), specify legend.position parameter with value "none", "top", etc.
# Boxplots
ggplot(iris, aes(x=Species, y=Petal.Length)) + geom_boxplot()
# try adding geom_point() to add actual data points, and specify col=rgb(0,0,0,0.1) in the geom_point(); if you do that, set outlier.colour = NA in the geom_boxplot(), to suppress the outlier points
# try adding aes(fill=Species) to color by categories
# Time series. Note that unlike the very flexible plot(), ggplot() expects a data frame, a simple numeric vector will not work.
series=data.frame(x=sort(runif(100))+runif(100,-0.1,0.1))
ggplot(data = series, aes(x = 1:100, y = x)) +
geom_line(color = "turquoise", size = 2) +
theme_minimal() +
NULL
# try adding x and y axis labels and title using e.g. labs(x = "time") or xlab(), and ggtitle()
# try changing the theme, e.g. theme_dark()
# a nice trick is to add NULL in the end of the ggplot call (on a new line) - this way, in case you remove the last option (e.g, the theme_minimal in this example), you don't need to delete the + on the last line (since NULL does nothing).
By the way, the plotly package we’ll use below gets along with ggplot2 very nicely, and you can convert plots created using the latter into interactive ones using the ggplotly function.
Time for a hands-on coding exercise. We’ve now seen how to create networks using R, but the data we used was not particularly interesting (well, it was fake). So far we’ve been using built in datasets and the aforementioned made-up data, all neat and clean. However, to successfully visualize data in real life, you often need to operationalize and clean it first.
The following exercise is about coding a messy database of quotes by musicians about other musicians, and then visualizing it as a nice network. The dataset is located here: https://goo.gl/5urKz9
Exercise. Work in small groups and transform the data in the “quotes” sheet into a machine-readable format in the “cleandata” sheet, in the following format, concatenating musician names by an underscore. Case does not matter, it will be lowercased when we import the data. While you’re at it, code the sentiment of the quote, either pos (positive) or neg (negative). Include the actual quote as well (if it seems too long, include only what feels relevant).
quoting ...... quoted ........ sentiment ... quote
john_lennon .. chuck_berry ... pos ......... whatever he said
Use an intermediary worksheet to clean the data (e.g. add a temporary sheet to the same google doc, use Excel, or Numbers on Mac, or just a plain text file - in that case, separate the columns by a TAB). Once done paste the block of data it into the cleandata sheet. Note that one quote may well create multiple entires, if multiple names are mentioned.
This time we’ll need to import the data from the Google sheets. We are going to do this in the following way (you can also export a TSV from Google Sheets and move it to the working dir if you want).
quotes.txt (on Windows: right click -> New -> text file; on Mac open up TextEdit or similar and save it to that folder; if you use TextEdit, make sure Rich Text mode is off).# This line imports the data. We specify that the file has a header (column labels), that quotation marks should not be treated specially, and that columns are separated by a tab.
quotes = read.table(file = "quotes.txt", header=T, quote = "", sep="\t")
# Lowercase the names:
quotes$quoting = tolower(quotes$quoting)
quotes$quoted = tolower(quotes$quoted)
# If you get the error "incomplete final line found by readTableHeader on 'quotes.txt'", a quick fix would be to go to the file and press enter at the end of the file to create an empty row.
# Exercise. Check if the data looks ok using head() and summary(). What's the general sentiment?
# Exercise. Since many people coded the data, and the data is messy, some names might be spelled differently - this would lead a person to be analyzed as two separate people below. Let's use Levenshtein edit distance to have a quick look.
everybody = unique(unlist(quotes[,1:2])) # this extracts all the names from the 2 columns
lev = adist(everybody) # Levenshtein edit distance matrix
diag(lev) = NA # remove self-similarity
dimnames(lev) = list(everybody,everybody) # add actual names to the matrix
# Inspect the object. Is the matrix is too big, use subsetting.
# This extracts the indices of the most similar name pair:
which(lev == min(lev, na.rm=T), arr.ind = TRUE)
# We can use these indices to compare the names in the names vector: replace 0,0 with what you found:
everybody[c(0, 0)]
# If this is the same person, it might be worthwhile to homogenize the spelling. Give the temporary object x the old name that you want to replace and the object y the name you want to replace it with. Then run the lines below. If not, don't.
x = "oldname"
y = "newname"
quotes$quoting[which(quotes$quoting == x)] = y
quotes$quoted[which(quotes$quoted == x)] = y
# If something blows up just import the clean file again and start over.
We’ll use the handy igraph graph construtor function and then convert it to a visNetwork object, all in once nested call.
library("visNetwork")
library("igraph")
vg = toVisNetworkData(graph_from_edgelist(as.matrix(quotes[,1:2]), directed=T) )
vg$nodes$size = 10
vg$edges$color = ifelse(quotes$sentiment == "pos", "darkgreen", "darkred") # conditional colors
vg$edges$title = quotes$quote
vg$edges$arrows = 'to'
# plot it:
visNetwork(vg$nodes, vg$edges)
# Exercise. Explore the network. Discuss with your neighbor.
# Bonus Exercise. There is not much more we can do with that dataset, but one (fairly absurd) hypothesis springs to mind. We have information on the names of the musicians and the sentiment of their quotes. Why not see if there is any correlation between name length and sentiment (as said, it's not the best research hypothesis in the world, but this is just an exercise in using another text analysis function).
# Use the following snippets to figure out
nchar("word") # counts the characters in a string
table(quotes$quoting, quotes$sentiment) # tabulates the variables
rowSums() # sums across rows
# Remember, / is for division and [,] or $ can be used for indexing; you'll want the proportion of (say, negative) quotes divided by the sum of both negative and positive quotes
# The goal is to get two vectors to compare/correlate/plot, one for the lengths of names and the other for the proportion of negative quotes of each musician.
# Bonus technical detail on R Markdown: I've set the eval=FALSE option on these chunks because by default R Markdown would not find the quotes.txt file like this when rendering the file into html later. The working dir of the R session and that of the markdown rendering process are different (respectively, they are whatever getwd() tells you, and the location of the Rmd file). This could be solved by using full file paths, but this is another layer of complication (if you are not used to thinking in paths) that we'll avoid at this point.
In the following examples, we’ll employ some light corpus analysis tools to visualize the content of the inaugural speeches of US presidents. We’ll start by looking into which presidents mention or address other presidents in their speeches. This is similar to the last exercise, but this time we’ll extract the mentions programmatically rather than hand-coding them.
library("quanteda", quietly=T, warn.conflicts=F) # make sure this is installed and load it; this also includes a dataset
## Package version: 1.3.0
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
library("igraph", quietly=T, warn.conflicts=F)
library("visNetwork", quietly=T, warn.conflicts=F)
speeches = data_corpus_inaugural$documents$texts # extract speeches data from the internal object
speeches = gsub("Washington DC", "DC", speeches) # replace city name to avoid confusion with president Washington (hopefully)
speechgivers = data_corpus_inaugural$documents$President # names of presidents giving the speech
presidents = unique(data_corpus_inaugural$documents$President) # presidents (some were elected more than once)
# Exercise: have a look at speech number 58, and check who's giving the speech. Hint: use the bracket [] notation
# The following piece of code looks for names of presidents in the speeches using grep(). Just run this little block:
mentions = matrix(0, ncol=length(presidents), nrow=length(presidents), dimnames=list(presidents, presidents))
for(president in presidents){
foundmentions = grep(president, speeches)
mentions[speechgivers[foundmentions], president ] = 1
}
# Note: this is not perfect - the code above concatenates mentions of multiple speeches by the same re-elected president, "Bush" refers to two different people, and other presidents might share names with other people as well. You can check the context of keywords using quanteda's kwic() command:
kwic(data_corpus_inaugural, "Monroe")
##
## [1885-Cleveland, 1202] It is the policy of | Monroe |
## [1909-Taft, 1784] bears the name of President | Monroe |
## [1925-Coolidge, 494] , and secured by the | Monroe |
##
## and of Washington and Jefferson
## . Our fortifications are yet
## doctrine. The narrow fringe
#
# Have a look at the data some basic stats:
mentions[30:35, 30:35] # rows: one mentioning; columns: being mentioned
## Carter Reagan Bush Clinton Obama Trump
## Carter 0 0 0 0 0 0
## Reagan 0 0 1 0 0 0
## Bush 1 1 1 1 0 0
## Clinton 0 0 1 0 0 0
## Obama 0 0 1 0 0 0
## Trump 1 0 1 1 1 0
counts = apply(mentions, 2, sum)
barplot(counts, horiz = T, las=1) # number of mentions
# Plotting time
pgraph = graph_from_adjacency_matrix(mentions, mode="directed") # this uses igraph
plot(pgraph, edge.arrow.size=0.4) # basic igraph plot
# this uses visNetwork:
pgraph_v = toVisNetworkData(pgraph )
v = visNetwork(nodes = pgraph_v$nodes, edges = pgraph_v$edges)
v # check how it looks before we add all the fancy stuff
v = visNodes(v, size = 10, shadow=T, font=list(size = 30))
v = visIgraphLayout(v, "layout_in_circle", smooth=T)
v = visEdges(v, arrows = "to", shadow=T, smooth=list(type="discrete"), selectionWidth=5)
v = visOptions(v, highlightNearest = list(enabled = T, hover = T, degree=1, labelOnly=F, algorithm="hierarchical"), nodesIdSelection = T)
v
# Bonus: You may notice that in the visNetwork help files, the examples use the magrittr package's %>% pipe notation. The example here does not, in order to keep things simple, but feel free to explore magrittr, it makes writing sequences such as the one below much more elegant. The same applies to the plotly examples below.
While we’re at it, let’s try to probe into the contents of the speeches and use some more interactive plotting tools to visualize it.
library("quanteda", quietly = T, warn.conflicts=F) # this needs to be loaded
library("plotly", quietly = T, warn.conflicts=F) # this too
# This block of code will extract the top terms (weighted by TF-IDF) from the most recent speeches and calculate the distance between the speeches based on word usage
termmat = dfm_tfidf(dfm(corpus_subset(data_corpus_inaugural, Year>1990), tolower = T, stem=F,remove=stopwords(), remove_punct=T))
topterms = lapply(topfeatures(termmat, n=10, groups=rownames(termmat)), names)
distmat = dist(termmat) # calculate distances
mds = as.data.frame(cmdscale(distmat,k = 2)) # multidimentsional scaling (reduces distance matrix to 2 dimensions)
mds$tags = paste(names(topterms), sapply(topterms, paste, collapse="<br>"), sep="<br>")
# The following makes use of the plotly package
p = plot_ly(mds,x=~V1,y=~V2, type="scatter", mode = 'markers', hoverinfo = 'text', text = ~tags)
p = add_annotations(p, text = ~rownames(mds), xanchor="left", showarrow = F)
p # closer points mark more similar speeches; hover to see key terms that distinguish the speeches
# A look into the usage of some words across centuries
termmat_prop = dfm_weight(dfm(data_corpus_inaugural, tolower = T, stem=F,remove=stopwords(), remove_punct=T), "prop") # use normalized frequencies
words = c("america", "states", "dream", "hope", "business", "peace", "war", "terror")
p = plot_ly(x=words, y=rownames(termmat_prop), z=round(as.matrix(termmat_prop[,words]),5), type="heatmap", colors = colorRamp(c("white", "orange", "darkred")),showscale = F)
p = layout(p, margin = list(l=130, b=50), paper_bgcolor=rgb(0.99, 0.98, 0.97))
p
# Exercise (easy). Choose some other words! Also try changing the color palette.
# Exercise (a bit harder). We could get a better picture of what has been said by the presidents if we expanded our word search with regular expressions. If you don't know regular expressions:
# 1. Make learning regular expressions your next life goal. Like, seriously.
# 2. For now, just know that ^ stands for the beginning of a string and $ for the end, and . stands for any character. So ^white$ would match "white" but not "whites", and l.rd would match "lord" but also "lard" etc. Define some new search terms; below are some ideas.
words2 = c("america$", "^nation", "^happ", "immigra", "arm[yi]", "^[0-9,.]*$")
# The bit of code below uses grep() to match column names, so unless word boundaries are defined using ^$, any column name that *contains* the search string is also matched ("nation" would match "international"). For each search term, it will find and sum all matching rows.
newmat = round(sapply(words2, function(x) rowSums(termmat_prop[, grep(x, colnames(termmat_prop))])),5)
# You can check which column names would be matched with:
grep("america", colnames(termmat_prop), value=T)
## [1] "american" "america" "americanism" "americans" "americas"
## [6] "america's" "american's"
# Then copy the plotly command from above and substitute the z parameter value with newmat.
Making maps programmatically based on data would come in handy if your worked with demographic data, or dialects, areal sociolinguistics, etc. We will look at two ways of plotting maps in R (there are numerous packages for that, all slightly different).
Here’s a quick look into making interactive maps using the plotly package. Note that plotly is under active development - at the time of writing this, the current dev version on GitHub actually already has better mapping support than the most recent CRAN version (which you installed using install.packages()). If you need these sort of things in your work, check out leaflet, which is good for working with very detailed (google-maps-scale) maps.
# Let's do Europe.
eur = data.frame(country = c("AUT","BEL","BGR","HRV","CYP","CZE","DNK","EST","FIN","FRA","DEU","GRC","HUN","IRL","ITA","LVA","LTU","LUX","MLT","NLD","POL","PRT","ROU","SVK","SVN","ESP","SWE","GBR", "NOR", "ISL", "RUS", "UKR", "BLR"), value = sample(seq(0,5,0.1),33)) # create some data
# Note this uses plot_geo() instead of plot_ly(); it uses the world map by default, but we'll limit the scope.
plot_geo(eur) %>% add_trace(locations = ~country, mode="none") %>% layout(geo = list(scope="europe")) # zoomable map
# Let's actually use it for something.
# I'll just use the magrittr pipes here because they are handy.
plot_geo(eur) %>%
add_trace(z = ~value, locations = ~country, color = ~value, colors = c("darkred", "lightgreen")) %>%
colorbar(title = "", thickness=10) %>%
layout(geo = list(scope="europe"), title="EurSoc Survey Q233: how manly is the Scottish kilt?", margin=list(l=0,r=0,b=0,t=30)) %>%
add_annotations(x= 1.04, y= 1, text = "like totally", showarrow = F) %>% add_annotations(x= 1.04, y= 0.52, text = "yea about that..", showarrow = F)
Before we finish, a word on R and its packages. It’s all free open-source software, meaning countless people have invested a lot of their own time into making this possible. If you use R, do cite it in your work (use the handy citation() command in R to get an up to date reference, both in plain text and BibTeX). To cite a package, use citation("package name"). You are also absolutely welcome to use any piece of code from this workshop, but in that case I would likewise appreciate a citation:
Karjus, Andres (2018). aRt of the Figure. GitHub repository, https://github.com/andreskarjus/artofthefigure. Bibtex:
@misc{karjus_artofthefigure_2018, author = {Karjus, Andres}, title = {aRt of the Figure}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/andreskarjus/artofthefigure}}, DOI = {10.5281/zenodo.1213335} }
That’s it for today. Do play around with these things later when you have time, and look into the bonus sections for extras. If you get stuck, Google is your friend; also, check out www.stackoverflow.com - this site is a goldmine of programming (including R) questions and solutions.
Also, if you are looking for consulting on data analysis and visualization or more workshops, take a look at my website https://andreskarjus.github.io/ . I am available for booking via the Edinburgh Uni PPLS Writing Centre (this service is for PPLS students only though) and sometimes hold workshops on these topics. If you want to stay updated keep an eye on my Twitter @AndresKarjus.
But wait! There’s one more thing to do. Since this is an R Markdown document, we can “knit” it into a nice HTML (or PDF, or Word) report file - it will show both the code and the plots produced by the code. Note that unfortunately this will not work if you have errors in your code - marked by the little red x signs on the left side vertical bar. To knit, click the Knit button (with the little blue ball of yarn) above the script window. If the code is without errors, an HTML document will appear.
Here are some more things you can try out at home later.
Small note: if you try knitting the RMarkdown file again later and would like to see output from the bonus sections, set eval=TRUE in these blocks, which will allow them to be rendered (all bonus blocks currently have the eval parameter set as FALSE). You might have also noticed the echo=F parameter - this just means the code itself will not be rendered in the knit output (even when it is executed).
We took a look at plotly earlier, which makes interactive plots. These also work in web pages (like the html file you could create by knitting this script file; R Markdown can also be used to create slides, meaning you could easily include interactive graphs in your next presentation). plotly can be used to create a plethora of different plots, including 3D ones:
Once you get around to working with your own data, you’ll need to import it into R to be able to make plots based on it. There are a number of ways of doing that.
This is probably the most common use case. If your data is in an Excel file formal (.xls, .xlsx), you are better off saving it as a plain text file (although there are packages to import directly from these formats, as well as from SPSS .sav files). The commands for that are read.table(), read.csv() and read.delim(). They basically all do the same thing, but differ in their default settings.
There is a simple way to import data from the clipboard. While importing from files is generally a better idea (you can always re-run the code and it will find the data itself), sometimes this is handy, like quickly grabbing a little piece of table from Excel. It differs between OSes:
For text, the readLines() command usually works well (its output is a character vector, so if the text file has 10 lines, then readLines produces a vector of length 10, where each line is an element in that vector (you could use strsplit() to further split it into words. If the text is organized neatly in columns, however, you might still consider read.table(), but probably with the stringsAsFactors=FALSE parameter (this avoids making long text strings into factors, read up on it if needed).
RStudio has handy options to export plots - click on Export on top of the plot panel, and choose the output format. Plots can be exported using R code as well - this is in fact a better approach, since otherwise you would have to click through the Export menus again every time you change your plot and need to re-export. Look into the help files of the jpeg and pdf functions to see how this works.
There are a number of ways for creating animated plots in R and making nice GIFs that you can use in a talk, on your website or wherever. There is the animation package, and plotly supports animations; or on a Mac you can use ImageMagick’s Terminal commands to convert any plot files into a GIF (you can send commands to Mac’s Terminal using the system() command; learn about loops to easily generate a number of plots with only a few lines of code).
There are also packages to import (and manipulate) images, GIS map data, relational databases, data from all sorts of other file formats (like XML, HTML, Google Sheets) and many more. Just google around a bit and you’ll surely find what you need.
Social networks
The following example will look into plotting social networks of who knows who.
Let’s try something else. Using the same graph data, we’ll recreate it using another package, visNetwork, which makes graphs interactable.